Back

Journal of Biomedical Informatics

Elsevier BV

Preprints posted in the last 7 days, ranked by how well they match Journal of Biomedical Informatics's content profile, based on 45 papers previously published here. The average preprint has a 0.07% match score for this journal, so anything above that is already an above-average fit.

1
A Heterogeneous Graph Neural Network Framework for Multi-Horizon Stroke Mortality Prediction

Tharzeen, A.; Vafaei Sadr, A.; Radfar, N.; Hwang, W.; Abedi, V.; Zand, R.

2026-06-10 health informatics 10.64898/2026.06.09.26355176 medRxiv
Top 0.1%
28.7%
Show abstract

Background: Machine learning models for stroke mortality prediction typically treat each time horizon independently and use flat tabular features that ignore the relational structure of electronic health records (EHRs). In this pilot study, we leveraged graph-based machine learning models to predict post stroke all-cause-mortality across three different time horizons. Methods: We developed Stroke Temporal Heterogeneous Graph (StrokeTHG), a heterogeneous graph neural network model for simultaneous multi-horizon stroke mortality prediction (30-day, 90-day, 1-year) using EHR data from Penn State Health System. The model encodes various relations among EHR entities (e.g., patient, diagnosis, comorbidity) and temporal encoding of admission time to better predict stroke mortality. We compared our proposed approach against various baseline methods, including Logistic Regression, Random Forest, and XGBoost. We also performed ablation and subgroup analyses, evaluated the quality of learned graph embeddings, and assessed the importance of different edge types in the graph. Results: We included 4,144 stroke patients (mean age 69.2 years; 54.3% men), of whom 3,332 (80.4%) survived their stroke after one year. 30-day, 90-day, and 1-year mortality rates were 9.7%, 13.7%, and 19.6%, respectively. Our proposed approach, StrokeTHG, achieved AUROC of 0.872, 0.878, and 0.837 across horizons, outperforming all tabular baselines. At [≥] , 75% specificity, the model identified 5-10 percentage points more mortality cases than the best baseline at each horizon. Subgroup analysis demonstrated consistent performance across sex subgroups and the largest discriminative gains in the Age 65-80 stratum. Edge-type ablation identified phenotype-patient and admission-patient edges in the constructed EHR graph as the most influential relational edges for mortality prediction. StrokeTHG embeddings outperformed all graph and matrix factorization baselines under an identical downstream classifier, confirming that performance gains stem from representation quality rather than classifier capacity. Conclusions: StrokeTHG demonstrates that heterogeneous graph representations of EHR data provide a consistent improvement over flat tabular models for multi-horizon stroke mortality prediction, with particular advantage at clinically actionable sensitivity thresholds and novel multi-horizon monotonic prediction capability. This methodological framework may be adaptable to other EHR-based clinical research studies seeking to leverage heterogeneous relational structures for predictive modeling.

2
Don't stop the heart: a performance analysis of large language models and potassium dosing

Blotske, K.; Zhao, X.; Henry, K.; Murray, B.; Gao, Y.; Smith, S. E.; Wayne, N.; Ku, P.; Smith, B.; Moua, S.; Sikora, A.

2026-06-04 pharmacology and therapeutics 10.64898/2026.06.02.26354762 medRxiv
Top 0.2%
6.9%
Show abstract

Background: Electrolyte replacement is ubiquitous in the acute care setting, but its familiarity cannot belie that even small dosing errors with potassium can cause lethal cardiac arrhythmias. Recently, MedAgentBench offered a benchmark for agentic artificial intelligence (AI) including the ability to correctly dose potassium based on a single rule; however, this does not adequately reflect the clinical complexity or safety concerns of an agent that has been used as the lethal injection. The purpose of this analysis was to a probe leaderboard large language model (LLM) capabilities to follow basic dosing rules to safely replace potassium in a series of clinician-annotated cases. Methods: Using a clinician panel, we developed a series of dosing principles and 20 clinical cases reflective of the complexity of potassium replacement. External clinicians were surveyed to assess practice variability and agreement to clinician panel answers. We tested GPT-5-chat with each case in triplicate, with and without the clinician curated dosing principles, and prompted the model to answer six questions involving potassium goals, dosing, route, lab frequency, concurrent interventions, and the model's perceived level of confidence for the output and complexity of the case. The primary outcome was the rate of appropriate recommendations in comparison to clinician answers. Results: A total of 54 clinicians reviewed the 20 hypokalemia cases and hypokalemia dosing guideline. Clinicians expressed "highly agree" or "somewhat agree" for 66.8% of the cases evaluated when asked if they agree with the guideline-recommended management. When given the potassium dosing guideline, total errors dropped from 165 to 104, and average accuracy improved from 45% to 65% with GPT-5-Chat. GPT-5-Chat conveyed a high level of confidence for 100% of responses, while labeling 80% and 76% of cases as highly complex with and without the criteria, respectively. Potential harm scores were considerable in both groups, however, a notable reduction in severity scores occurred with the dosing guidance document. Recommendations on concurrent interventions and dosing had the highest rate of errors in both groups. Conclusions: Benchmarks must appropriately reflect clinical complexity to be considered valuable for the deployment of agentic artificial intelligence tools in the healthcare domain. GPT-5-Chat assessment on a comprehensive medication management task for potassium replacement showed improvement with dosing guidance, yet unfit benchmarking performance.

3
Clinician-Centered Evaluation of Large Language Model-Generated Discharge Summaries for Longer Hospitalizations: Insights from Hospitalists and Primary Care Physicians

Osborne, T.; Mahmud, T.; Zheng, X.; Jampala, S.; Abbasi, S.; Hong, S.; Kranz, K.; Lee, S.; Ng, P.; Odekon, K.; Schachter, L.; Sexton, R.; Spinnato, T.; Tharakan, M.; Wu, Z.; Wang, F.; Wong, R.

2026-06-05 health systems and quality improvement 10.64898/2026.06.03.26354858 medRxiv
Top 0.2%
6.4%
Show abstract

Although large language models (LLMs) have shown promise for discharge summary generation, their value may be greater in longer hospitalizations, where increasing documentation volume and complexity increase both clinician burden and the risk of communication failures during transitions of care. Prior evaluations of LLM-generated discharge summaries have largely involved shorter stays and have rarely examined receiving-clinician priorities or incidental finding reporting. We compared LLM-generated and human-authored discharge summaries for 60 Internal Medicine hospitalizations lasting 7 to 21 days, with paired assessment by hospitalists and primary care physicians (PCPs). Clinician reviewers preferred LLM-generated summaries for 95% of encounters and rated them higher for quality, readability, factuality and completeness. PCPs, the primary recipients responsible for post-discharge care, found that LLM-generated summaries were better for understanding and communicating hospital care to patients, and providing follow-up care. LLM-generated summaries had fewer annotated errors, primarily due to fewer omissions, without increased estimated harm potential or likelihood compared with human-authored summaries. Benefits of LLM-generated summaries were especially salient for PCPs, who identified more omissions with greater downstream likelihood of harm than hospitalists. This underscores the importance of designing transition documents around the needs of clinicians assuming care post-discharge. LLM identification of radiology incidental findings was generally accurate and appropriate, suggesting potential to improve follow-up of clinically relevant findings. These findings extend prior work by demonstrating clinical value of LLMs in summarizing longer, complex hospitalizations and highlighting the value of stakeholder-centered design in clinical AI systems. Together, they support supervised LLM-assisted discharge summarization as a tool to reduce cognitive burden, improve documentation quality, and enhance transition-of-care communication.

4
EXHEART: A Fairness-Aware Explainable Stacked Ensemble for Cardiovascular Disease Classification with Cross-Instrument Disparity Attribution

Biswas, M. A.; Laila, A.

2026-06-05 health informatics 10.64898/2026.06.03.26354879 medRxiv
Top 0.3%
4.7%
Show abstract

Background: Machine learning models trained on population health surveys offer scalable tools for cardiovascular screening, but recurring methodological weaknesses undermine their credibility and equity: data leakage from synthetic oversampling, qualitative rather than quantitative explainability evaluation, and the absence of demographic fairness auditing at the clinical operating threshold. Methods: We present EXHEART, a leakage-free stacked ensemble pipeline trained on BRFSS 2015 (n = 253,680) and validated on BRFSS 2020 (n = 319,795; temporal transport and retrain) and a clinical cardiovascular examination dataset (n = 68,730). The pipeline combines XGBoost, LightGBM, Random Forest, and a multi-layer perceptron as base learners with 5-fold out-of-fold logistic regression stacking and Platt scaling calibration. A quantitative SHAP-LIME consistency framework, based on Kendall-tau rank correlation and Jaccard overlap, accompanies a decision-curve analysis, a subgroup-stratified SHAP interaction analysis, and an intersectional fairness audit (Sex x Age x Income) with threshold-shifting mitigation and a frontier of the fairness-utility trade-off. The framework also adds cross-instrument fairness-disparity attribution, an empirical diagnostic that provides evidence on whether an observed subgroup disparity is more consistent with a measurement-induced or a substantive explanation by re-validating it on a dataset that measures the same clinical construct objectively. On heart disease, this diagnostic associates 89% of the sex TPR gap (95% CI [0.65, 0.99]) with the self-reported survey outcome rather than with a substantive risk difference. Results: On BRFSS 2015, EXHEART achieves AUC-ROC = 0.850, AUPRC = 0.371, Brier score = 0.071, and reduces ECE by 96% (0.256 to 0.011) via Platt scaling. Global SHAP-LIME rank agreement is moderate-to-strong (Kendall-tau = 0.580, Spearman-rho = 0.818) with a substantial top-3 divergence (Jaccard@3 = 0.200), where Stroke flips from SHAP rank 8 to LIME rank 1. The Sex TPR gap is 0.124 at the screening threshold; intersectional Sex x Age disparities reach 0.649 among adequately-powered cells, 5.2x the single-attribute gap. Temporal transport to BRFSS 2020 collapses sensitivity from 0.776 to 0.267, while retraining restores AUC = 0.840 and ECE = 0.012. On clinical examination data, the Sex TPR gap collapses to 0.014; the attribution test indicates this gap is instrument-dependent, consistent with a measurement or outcome-definition explanation rather than a substantive risk difference. Cross-domain SHAP analysis identifies four instrument-independent CVD risk factors and two major portability failures. Conclusions: EXHEART combines three practices that population-scale cardiovascular classifiers usually apply in isolation: leakage-free training with calibrated probabilities, a test of whether the model's explanations are stable, and a fairness audit that examines intersecting subgroups rather than single attributes. Bringing them together proved worthwhile. The intersectional audit revealed disparities that single-attribute auditing missed, and the cross-instrument comparison indicated that much of the sex gap reflects how the outcome is measured in survey data rather than a substantive difference in risk. The temporal transport findings indicate that deployed BRFSS models warrant periodic monitoring and retraining to maintain clinical utility. EXHEART is a retrospective methodological evaluation on public de-identified data; it is not validated for direct clinical decision-making, diagnosis, or treatment recommendation without prospective clinical validation.

5
Characterizing Documented Psychosocial Stressors in Pediatric Psychiatric Emergencies with an Open-Weight Large Language Model

Hartlage, C. S.; Manning, E. R.; Bernard, J.; Vaish, S.; Gray, J.; Young, M.; Pestian, T.; Folger, A. T.; Tachinardi, P.; Mendonca, E. A.; Brokamp, C.

2026-06-09 health informatics 10.64898/2026.06.08.26354931 medRxiv
Top 0.3%
4.7%
Show abstract

Objective: To evaluate whether a locally hosted open-weight large language model (LLM) can extract documented psychosocial factors from pediatric psychiatric intake notes and apply validated extraction to a large emergency psychiatry cohort. Materials and Methods: We identified emergency department presentations at Cincinnati Children's Hospital Medical Center from January 1, 2016, through December 31, 2024, among patients younger than 18 years with psychiatric billing diagnoses. Using full-text intake notes, gpt-oss:120b classified peer conflict, sleep disruption, and school-related academic, attendance, and disciplinary issues as detected, negated, or indeterminate. Four human raters independently reviewed 50 notes. We compared Fleiss' kappa among humans alone versus humans plus the LLM, assessed repeated-query stability across 50 independent calls per note, and applied the workflow to all eligible notes. Results: Among 37,315 eligible admissions, 22,284 had eligible intake notes; 22,270 produced parseable JSON. In detected-versus-not-detected coding, human-plus-LLM reliability did not differ significantly from human-only reliability across measures (human {kappa} 0.71-0.94; human-plus-LLM {kappa} 0.70-0.93). Stability was associated with human agreement: mean LLM-human agreement increased from 42.6% for classifications with less than 80% stability to 82.7% for classifications with 100% stability (Pearson r = 0.36). Full-cohort extraction showed frequent and overlapping documented factors: sleep disruption was most frequently detected (57.7%), followed by peer conflict (47.2%), academic issues (43.4%), disciplinary issues (43.3%), and attendance issues (16.9%). Discussion: Agreement varied by construct and was strongest when repeated model outputs were stable. Conclusion: Locally hosted open-weight LLMs can support scalable structured extraction of documented psychosocial factors from pediatric psychiatric intake notes after local validation.

6
Genosolver: Rare Disease Diagnosis through Holistic Integration of Unstructured Clinical Narratives Using Large Language and Reasoning Models

Islam, T.; Danner, M.; Ziad, Z.; Begemann, M.; Beijer, D.; Lischka, A.; Lausberg, E.; Mattern, L.; Suh, J.; Wittig, P.; Guezel, N.; Schlaich, E.; Karaivanova, R.; D'Augello, S.; Franken, L.; Ruedebusch, J.; Mueller, R.; Perchalla, E.; Zempel, H.; Haag, N.; Eggermann, K.; Eggermann, T.; Meyer, R.; Kraft, F.; Elbracht, M.; Kurth, I.; Krause, J.

2026-06-05 health informatics 10.64898/2026.06.04.26354845 medRxiv
Top 0.3%
4.3%
Show abstract

Background: Molecular medicine has made genetic diagnostics crucial for rare diseases, but the majority of patients remains without diagnosis even after state-of-the-art assessment. Standardized systems for integrating clinical features, such as the Human Phenotype Ontology (HPO), offer assistance, but are often insufficiently detailed and fail to capture crucial clinical parameters such as age at onset, longitudinal changes in symptoms, detailed characteristics of a clinical symptom, or the absence of a feature. Results: We present Genosolver an integrated workflow that utilizes machine learning to address this bottleneck. Using Large Language Models (LLMs) and Large Reasoning Models (LRMs) on unstructured clinical notes and electronic health care data, we generate a workflow that unifies phenotype extraction, generates differential diagnosis, and prioritizes genetic variants from genome data. We evaluated the performance on 233 previously genetically solved cases, where Genosolver ranked the causative gene first in 72% of cases and in 94% of cases in the top 10 gene list, outperforming the existing benchmarking tool Exomiser by 9%. Semi-automated reanalysis of 1,875 unsolved rare disease cases yielded an additional diagnostic rate of 1.7%. Incorporating rich, unstandardized clinical narratives substantially enhanced model performance beyond HPO-only inputs and demonstrated competitive results using data security compliant local models. Conclusion: Integrating unstandardized clinical data with local LLMs and reasoning offers a scalable, data-secure workflow that increases molecular diagnoses in rare diseases.

7
A Data-Driven Framework for Generating Population-Linked Case Vignettes from Nationwide Triage Data

Seidel, A.; Steiger, E.; Schuster, J.; Kroll, L. E.

2026-06-10 health informatics 10.64898/2026.06.08.26354886 medRxiv
Top 0.4%
4.0%
Show abstract

Background: Digital decision-support tools such as triage systems and symptom checkers support millions of health-related decisions each year. Their quality and safety are commonly evaluated using textual patient cases, known as case vignettes. However, existing vignette sets written by medical experts cover only a limited spectrum of real-world patient presentations and lack population weights, which would allow extrapolating evaluation results to the underlying patient population. Objective: This study aims to develop a data-driven framework for automatically generating a human-manageable set of case vignettes from nationwide triage data that captures broad presentation diversity and links each vignette to a quantitative weight reflecting the number of underlying patient assessments. Methods: From 3.2 million triage assessments conducted over one year using structured triage software in the German medical on-call service (telephone triage and online self-triage) and at the joint contact points of the outpatient emergency care service and hospital emergency departments, we randomly sampled 50,000 cases. Triage questionnaires were converted into semantic embeddings using a German Sentence Transformer Model and grouped by agglomerative clustering. For clusters containing sufficient assessments, we generated one representative assessment using a two-phase simulated-annealing optimization. The optimization minimized the distance to the cluster centroid while maximizing the number of answered triage questions, aiming for high representativeness and information content. Each representative assessment was assigned the size of its source cluster as its sample-based weight. A similarity-based sensitivity analysis was performed to examine whether these weights were preserved in the full 1-year population. Finally, the question-answer pairs of the representative assessments were converted into structured textual case vignettes using controlled prompting of a large language model. Results: The cluster analysis yielded 514 included clusters covering 96.8% of the sampled 50,000 assessments. The generated representatives showed strong agreement with the majority treatment-urgency recommendation of their source cluster (Spearman's {rho}=0.78, p<0.001) and contained on average 4.3 more answered triage questions than the original assessments within their clusters. When weighted by cluster size, the representatives approximated the sample distributions of treatment urgency, demographics, and symptoms, although some systematic deviations remained, most notably an overrepresentation of female cases (+13.5%), patients aged 14-49 years (+8.0%), and the urgency category "As soon as possible" (+6.6%). Of 121 recorded symptoms, 101 (83.5%) were covered by the representatives; the rest each occurred in <0.5% of the sample. In a sensitivity analysis, cluster-based vignette weights were strongly correlated with similarity-based population weights (Spearman's {rho}=0.77, p<0.001), and 90.1% of assessments in the full 1-year population were matched to at least one vignette. Conclusions: We present a data-driven framework for deriving a manageable set of population-weighted case vignettes from nationwide triage data. The resulting vignettes captured broad presentation diversity, approximated key sample characteristics, and provided an explicit quantitative link to the number of underlying patient assessments. After medical expert review and refinement, the vignettes may support more population-aware evaluation and quality assurance of digital decision-support tools.

8
Study Design Indexing in Transition: A Focused Comparison of manual NLM Indexing vs. Transformer-based Automated Models

Das, P.; Schneider, J.; Mayo-Wilson, E.; Kilicoglu, H.; Menke, J. D.; Nam, D.; Ninan, K.; Oberste, J.-P.; Troy, A. M.; Ying, X.; Holt, A. W.; Smalheiser, N. R.

2026-06-04 health informatics 10.64898/2026.06.03.26354854 medRxiv
Top 0.4%
4.0%
Show abstract

Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design. Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs--cohort, case-control, cross-sectional, and case report. Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission. Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.

9
General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

Rajeev, M.; Narayan, A.

2026-06-10 gastroenterology 10.64898/2026.06.06.26354838 medRxiv
Top 0.4%
4.0%
Show abstract

Background: Unstructured data represent about 80% of total electronic health records (EHR) data. Structuring this free text is essential for advancing clinical research, including cohort selection for trials, retrospective studies, and the development of disease registries. While manual chart review (MCR) remains the gold standard for extracting this clinical data, the process is inherently slow, resource-intensive, and susceptible to errors from human fatigue. We evaluated the extraction accuracy, safety, and efficiency of the HeLIX (Hepatology Logic-Integrated Extraction) framework, a Large Language Model (LLM) protocol using Google Gemini 3 Pro, compared to a gold-standard Manual Chart Review (MCR). Methods: A prospective validation study was conducted using 50 high-complexity, simulated hepatology discharge summaries designed to replicate the real-world heterogeneity of EHRs. The HeLIX framework employed a Zero-Shot, Structured Chain-of-Thought (CoT) prompting strategy enforced by a three-layer architecture: Clinical Reasoning Trace, Schema Enforcement, and Evidence Verification. The model extracted 45 distinct clinical variables. Performance was benchmarked against a consensus MCR. Results: Across 2,250 evaluated data points, the model achieved an overall Extraction Accuracy of 99.24% (95% CI: 98.8%-99.5%), with perfect concordance in 35/45 (77.8%) variables. For binary diagnostic variables, the model demonstrated an overall F1-score of 0.98, Recall of 0.99 and substantial inter-rater reliability (Cohens {kappa} = 0.97). Hallucinations were exceptionally rare (2/2250; 0.08%). Critical errors affecting clinical management occurred in only 2 instances (<0.1% of total data), both involving etiological misattribution in complex multifactorial diagnoses. The AI workflow was 13.4-fold faster and 95.1% more cost-effective than manual extraction. Conclusion: The HeLIX framework demonstrates physician-level accuracy and reliability in extracting complex hepatology data. It offers a scalable, efficient, and economical alternative to manual chart review. Such frameworks could accelerate clinical research, enabling healthcare systems globally to build comprehensive patient registries for a fraction of the traditional cost.

10
Prototyping a Generative AI-powered Person-centered Digital Health Tool to Mitigate Risk of Preventable Adverse Drug Events

Dobbins, D.; Russell, A.; Gunther, M.; Shetty, V.; Shomali, A.; Vawdrey, D.; Waring, S.; Whary, P.; Wong, J.; Wright, E. A.; Olson, A. W.

2026-06-04 health systems and quality improvement 10.64898/2026.06.02.26354712 medRxiv
Top 0.5%
3.6%
Show abstract

Objectives: Older adults with comorbidities and polypharmacy have disproportionately high risk of hospitalization as well as readmission from adverse drug events (ADEs), of which 28%-71% are preventable (pADEs). This paper introduces an LLM application, CommunicADE, designed to support risk-mitigation of pADE-related readmission for the aforementioned population. We aim to evaluate CommunicADE's technical performance with OpenAI's HealthBench criteria: accuracy, completeness, communication quality, context awareness, and instruction following. Materials and Methods: Our technical validation study used an LLM (KimiK2.5) to simulate interviews between CommunicADE and nine high-fidelity synthetic patients hospitalized and at increased risk for pADE-related readmission (65+ years, comorbidities, 5+ medications). Some pADE risk mechanisms clues were visible to CommunicADE in patient H&Ps, but most mechanisms were solely discoverable in interviews. Two pharmacists evaluated CommunicADE's interview questions and EHR notes with HealthBench-informed variables. Analyzes used descriptive statistics. Results: For 35 mechanisms across 9 patients (avg=3.89 mechanisms/patient), CommunicADE's precision and recall were 0.92 and 0.63, respectively. Hallucinations were absent. Coherence and person-centeredness scored 4.28 and 4.44 on a 5-point scale (5=highest). On average, communication was at a 5th grade level and objective for 78% of patients. Most patient-reported quotes included in notes (92%) supported detected mechanisms. CommunicADE followed all instructions regarding interview length and patient approvals. Discussion: CommunicADE's strongest performance was in accuracy (precision, hallucinations), communication quality (coherence, readability), context awareness (person-centeredness). Completeness (recall) and instruction following (objectivity, pADE mechanism/quote alignment) show room for improvement. Conclusion: Findings suggest technical readiness for a feasibility pilot with real-world patients, and key areas for performance improvement.

11
Development and Prospective Validation of Predictive Model for Early Hemodynamic Deterioration in Critical Care: A Multicenter Study

Nagori, A.; Singh, P.; Firdos, S.; Devadiga, A.; Vats, V.; Gupta, A.; Bandhey, H.; Ailavadi, P.; Awasthi, R.; Narotam, N.; Mishra, A.; Lodha, R.; Sethi, T.

2026-06-10 intensive care and critical care medicine 10.64898/2026.06.05.26353765 medRxiv
Top 0.5%
3.5%
Show abstract

High-frequency physiological monitoring in ICUs can identify impending deterioration hours before clinical recognition yet extracting reliable early-warning signals from noisy vital-sign streams remains challenging. We present SIgnose, an interpretable prediction framework for early detection of abnormal shock index (SI), built from routinely monitored vital signs using physiologic variability and nonlinear time-series features. SIgnose was developed on the eICU Collaborative Research Database and externally validated on the MIMIC-III adult database and a pediatric SafeICU cohort (AIIMS New Delhi), with additional prospective validation in the pediatric ICU. We benchmarked three representation strategies: (i) engineered physiologic variability and nonlinear time-series features, (ii) deep learning, and (iii) Llama-3.1-8B embeddings with low-rank adaptation. Physiologic variability features consistently demonstrated superior cross-cohort generalization. The final model used 3,970 features from five vital signs to predict abnormal SI up to 8 hours ahead, achieving AUROC 0.861 (95% CI 0.859-0.863) and AUPRC 0.927 (95% CI 0.925-0.929) on eICU. External validation yielded AUROC 0.870 (95% CI 0.863-0.876) and AUPRC 0.935 (95% CI 0.930-0.940) on MIMIC-III, and AUROC 0.875 (95% CI 0.863-0.888) and AUPRC 0.915 (95% CI 0.898-0.930) on SafeICU; prospective pediatric validation (n = 88) achieved AUROC 0.885 (95% CI 0.868-0.902) and AUPRC 0.911 (95% CI 0.882-0.936). SHAP interpretability analysis identified heart rate variability, respiratory trend dynamics, and multi-scale blood pressure variability as key early-warning signatures. These findings establish SIgnose as a reproducible, low-compute, early-warning framework and demonstrate that physiologic variability features provide robust, generalizable representations for early deterioration detection across adult and pediatric critical care.

12
A hierarchical clinical fusion transformer model for personalized opioid treatment: Development and validation in diabetic surgical patients

Naderalvojoud, B.; Sutjiadi, B. J.; Koul, A.; Curtin, C.; Gevaert, O.; Hernandez-Boussard, T.

2026-06-08 health informatics 10.64898/2026.06.04.26353331 medRxiv
Top 0.5%
3.5%
Show abstract

Background Machine learning (ML) models are increasingly used to predict adverse outcomes after surgery. However, most rely on static patient characteristics (e.g., age, comorbidities) and overlook clinician-controlled treatment decisions that can be actively modified at the point of care. Discharge opioid prescribing is a key modifiable, clinician-controlled decision, yet optimizing prescribing choices across multiple adverse outcomes remains underexplored in predictive modeling. This study addresses that gap by introducing a novel ML framework that explicitly separates fixed patient risk factors from modifiable prescribing options to support personalized, risk-informed opioid prescribing decisions. Methods We developed the Hierarchical Clinical Fusion Transformer (HCF-Transformer), an ML model designed to estimate patient-specific risks across four postoperative outcomes: prolonged opioid use (POU), chronic pain (CP), 30-day readmission, and opioid-associated outcomes (OAO). The model constructs patient risk profiles from fixed, non-modifiable baseline factors, followed by a transformer layer. Clinician-controllable discharge opioid regimens are modeled as alternative intervention candidates and fused with the fixed risk representation through a clinical fusion mechanism, enabling assessment and ranking based on predicted risks. A Total Relative Risk (TRR) metric, calibrated to each outcome prediction threshold, guides the recommendation process. We evaluated the model in diabetic surgical patients, a common high-risk population. Results The study included 157,853 unique diabetic surgical patients, with outcome prevalences ranging from 47.2% (POU) to 1.8% (OAO). The HCF-Transformer achieved the highest AUROCs, 0.798 for POU, 0.712 for 30-day readmission, 0.808 for CP, and 0.922 for OAO, outperforming Random Forest, FT-Transformer, and ResNet-based models. Compared to these baselines, HCF-Transformer generated more stable and discriminative risk estimates and demonstrated significant variation in TRR scores across discharge opioid options (ANOVA p < .01, eta-squared > .01). This enabled consistent identification of lower-risk regimens tailored to patient-specific profiles. Conclusions The HCF-Transformer introduces a novel hierarchical fusion approach to optimize opioid prescribing by integrating static patient risk profiles with modifiable discharge options. Using transformer-based modeling and a quantifiable TRR metric, the model delivers personalized, risk-aware recommendations. This approach enables data-driven opioid prescribing tailored to individual risk and has the potential to improve postoperative outcomes in high-risk populations. Our findings demonstrate that integrating modifiable factors with structured risk profiles through a transformer-based fusion architecture can enhance decision-support systems, paving the way for more actionable and personalized AI in healthcare.

13
When Algorithms Prescribe: A Cross-Sectional Study of Quality, Misinformation, and Engagement in Statin-Related Content on TikTok

Gharibyan, I.; Ahner, E.; Shao, R.; Sharma, D.; Navarsartian Tazehkand, T.; Diep, J.; Assoumou, B.

2026-06-08 health informatics 10.64898/2026.06.04.26354962 medRxiv
Top 0.5%
3.5%
Show abstract

Background: Statins are key to preventing atherosclerotic cardiovascular disease and lowering low-density lipoprotein cholesterol and cardiovascular events. However, skepticism regarding their safety and value persists and is increasingly influenced by social media. TikTok has emerged as a major source of health information, but its content varies in quality and accuracy. This study evaluated the quality, attitudes, misinformation, and engagement of statin-related content on TikTok. Methods: Public TikTok videos were collected using predefined search terms and coded by creator type, thematic content, and overall attitude. Video quality was assessed using the DISCERN instrument, the Patient Education Materials Assessment Tool for Audiovisual Materials, and the Global Quality Score. False or misleading claims were independently reviewed by two cardiology fellows. Associations between engagement and quality were also examined. Results: Of 1,349 screened videos, 258 met inclusion criteria. Most were educational (91.0%), with non-physician healthcare providers (34.5%) as the largest creator group. Risks or negative effects were discussed more often than benefits (63.2% vs 42.2%), and 39.5% contained at least one false or misleading claim, most often from complementary and alternative medicine providers and wellness promoters. Quality differed by creator type across all instruments, with physician-created content scoring highest. Video popularity showed minimal association with informational quality. Conclusion: Statin-related TikTok content frequently emphasizes harms, often contains misinformation, and varies substantially in quality by creator type. Greater involvement of healthcare professionals on social media may help improve digital health literacy and counter misleading information about statin therapy.

14
An Explainable Multimodal AI Framework with Reinforcement Learning for Post-Surgical Clinical Decision Support

Ahmed, M.; Ahmed, F.; Mow, S. M.; Taha, P. A.; Barua, S.; Rahman, M. M.; Rafy, A.; Mondol, S. M.; Faisal, M. I.

2026-06-10 health informatics 10.64898/2026.06.08.26355217 medRxiv
Top 0.5%
3.3%
Show abstract

Post-surgical adverse outcomes, including mortality, intensive care readmission, and complications, remain major challenges for clinical decision-making. Existing machine learning approaches focus on outcome prediction while operating as opaque systems, limiting clinical trust and the translation of predictions into treatment decisions, and many clinical studies rely on synthetic data in which shared intermediate variables create circular dependencies between inputs and targets that compromise reported performance. We aimed to develop an explainable multimodal architecture and a rigorous evaluation methodology that address these gaps. We designed a two-stage architecture integrating supervised deep learning for risk prediction with conservative Q-learning for action recommendation. The first stage uses five modality-specific encoders for structured records, physiological time-series, chest radiographs, clinical notes, and surgical metadata, unified through cross-modal attention into a shared patient-state representation. The second stage applies offline reinforcement learning to recommend clinical actions while preventing value overestimation. We formally characterized a target-leakage flaw in synthetic pipelines and propose a real-data methodology using a verified clinical database, with event-censored temporal separation and uncertainty-weighted per-task training. Component-level behavior was validated on a controlled synthetic benchmark, demonstrating that the architecture functions as designed without claiming clinical validity. The cross-modal attention and risk-prediction components behaved as expected, whereas the offline reinforcement learning stage did not converge on the benchmark, indicating that value estimation requires further investigation on real clinical data. The architecture provides dual-level explainability through attention visualization and value decomposition, contributing a deployable design, a formal methodological critique of synthetic-data practices, and a complete framework for clinically valid evaluation.

15
A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

Proulx, J.; Daines, B.; Barton, M.; Leonard, M. E.; Garcia, J. A.; Young, B.; Snell, Q.; West, T. W.; Watson, S. R.; AlQaseer, M.; Louiset, M.; Maqsood, M. B.; Voutt-Goos, M. J.; Douma, C.; Kasbekar, N.; Jeffries, J.; Abu-Rahmeh, W.; Frush, K.; Grewal, D. K.; Bahsoun, M.; Leonard, M.; Frankel, A.; Classen, D. C.; Pestotnik, S. L.

2026-06-10 health informatics 10.64898/2026.06.05.26354271 medRxiv
Top 0.5%
3.1%
Show abstract

Objective. To introduce PsiBench, a clinically validated medication-safety benchmark for evaluating large language models (LLMs) against the standards used to certify hospital computerized provider order entry (CPOE) and electronic health record (EHR) systems, and a non-overlapping three-tier evaluation framework separating highest-stakes discrimination, the operational CDS regime, and category-correct alerting. Materials and Methods. PsiBench comprises 492 medication-safety scenarios across 11 safety categories, created by clinical pharmacology experts whose work underpins an annualized testing procedure used by more than 2,000 U.S. hospitals. The three-tier framework partitions the scenarios non-overlappingly: Discrimination (98 scenarios, 50 fatal vs 48 deception, near-balanced 51%/49%); Operational (394 scenarios, 261 serious unsafe plus 133 safe including 41 Excessive Alerts reclassified as operational negatives); and Attribution (311 alert-required scenarios). We evaluated 40 frontier LLMs from 10 providers over 3 runs per scenario at temperature 0.2 (or the provider default where temperature is not configurable), yielding 59,040 evaluations conducted April 21-23, 2026. Results. Headline binary performance on the full benchmark spans a wide range across the 40 models: F1 78.5%-92.3%, accuracy 65.4%-89.8%, sensitivity 81.4%-100.0%, specificity 6.1%-81.8%. Leading models by F1 (o4-mini 92.3%; o3 92.2%) pair high sensitivity with meaningful specificity; three models saturate sensitivity at 100% but fall below 25% specificity, indistinguishable from a naive always-alert classifier. The wide spread on a single headline metric motivates tier-specific analyses, developed in a separate clinical paper. Discussion and Conclusion. PsiBench and the three-tier framework operationalize a rigorous evaluation rubric for LLM medication safety, grounded in two decades of national hospital audit experience. The framework generalizes to any binary medication-safety classifier (rule-based, conventional ML, or LLM-driven), supporting tier-aware model selection and post-deployment surveillance.

16
Development of Longitudinal, Linked Maternal-Infant Cohorts using the Epic Cosmos Electronic Health Record Dataset

Leonard, S. A.; Dysart, K.; Callahan, A.; Siadat, S.; Zhang, J.; Handley, S. C.; Huybrechts, K. F.; Igbinosa, I.; Bateman, B. T.

2026-06-04 epidemiology 10.64898/2026.06.02.26354757 medRxiv
Top 0.5%
2.8%
Show abstract

Background: Epic Cosmos is a relatively new centralized electronic health record dataset with high potential utility in perinatal epidemiologic research. Objectives: The study objectives were to develop replicable steps to create longitudinal, linked maternal-infant cohorts in Cosmos, assess completeness of key variables, evaluate potential selection bias with restrictions for longitudinal healthcare encounters, and provide an example epidemiologic analysis. Methods: We created maternal-infant cohorts by starting with live births during 2023-2024 recorded in the BirthFact data table and joining with additional data tables as needed. We selected and created variables for perinatal characteristics, common comorbidities, and routinely measured vital signs and laboratory values, and assessed variable completeness. We sequentially restricted the birth cohort for maternal-infant linkage and longitudinal healthcare from first-trimester prenatal care encounter through infant follow-up care within 12 weeks post-discharge from birth hospitalization. Finally, we conducted an example analysis of the association between high systolic blood pressure in the first trimester ([&ge;]140 mm Hg) and later onset of preeclampsia among those with chronic hypertension. Results: The total linked birth cohort included 2,624,186 pregnancies. Completeness was >90% for most variables assessed but was 77% for racial and ethnic group and 76% for body mass index at delivery. Characteristics of the cohort were similar to those reported for the entire United States birth population based on birth certificate data, including similar regional and racial-ethnic composition. Longitudinal cohort restriction requiring linked records from first trimester prenatal care through infant follow-up care reduced the cohort size to 509,148 pregnancies. However, restriction had minimal effects on cohort characteristics. In the example analysis, high systolic blood pressure was associated with increased risk of preeclampsia among those with chronic hypertension (aRR: 1.26; 95% CI: 1.22, 1.30). Conclusions: This study provides a rigorous and reproducible approach to creating longitudinal, linked maternal-infant cohorts in Epic Cosmos and the analytical findings suggest high data quality and representativeness.

17
Identifying Clinical Diagnostic Trajectories Associated With Suicide Death Using Temporal Sequence Mining of Linked Claims and Mortality Data

Belouali, A.; Kitchen, C.; Haroz, E.; Lehmann, H.; Nestadt, P. S.; Wilcox, H. C.; Kharrazi, H.

2026-06-10 health informatics 10.64898/2026.06.08.26355231 medRxiv
Top 0.6%
2.5%
Show abstract

Background: Most approaches to suicide risk assessment consider clinical conditions as independent risk factors, potentially overlooking prognostic information in the order in which conditions accumulate. We applied temporal sequence mining to linked claims and mortality data to identify ordered clinical diagnostic trajectories associated with suicide death. Results: The cohort included 3 647 059 insured Maryland residents aged 10 years or older with available claims records in the Maryland Suicide Data Warehouse from January 1, 2016, to December 31, 2020, among whom 768 suicide deaths were ascertained through medical examiner linkage. Sequential pattern mining of ICD-10-CM diagnoses grouped into Clinical Classifications Software Refined categories identified 89 221 candidate sequences, of which 1 816 remained significantly associated with suicide death in time-varying Cox models. Adjusted hazard ratios (AHRs) ranged from 2.4 to 134.1. Two-thirds of significant trajectories ended in physical conditions, and approximately half crossed from psychiatric to physical endpoints. Among suicide decedents, 62% were exposed to at least 1 significant sequence (median, 16 per case); median sequence duration was 18.7 months, and median time from completion to death was 13.1 months. In landmark analyses, among patients with depression who later developed suicidal ideation (n = 26 356), the path through anxiety, then anemia, was associated with higher risk (AHR, 4.6; 95% CI, 2.2-9.5), whereas the anxiety-only path was not (AHR, 1.3; 95% CI, 0.8-2.1). Among patients with anxiety who later developed hypertension (n = 149 215), the path through history of self-harm was associated with higher risk (AHR, 32.0; 95% CI, 16.6-61.6). Associations were generally consistent across sex and age. Conclusions: Temporal ordering of clinical conditions may carry prognostic information for suicide death. Clinical trajectories incorporating physical illness within psychiatric sequences identified higher-risk groups. These findings suggest that opportunities for risk detection may extend beyond psychiatric settings and that suicide risk signals may be fragmented across care settings and not apparent within isolated encounters.

18
Metatranscriptomics-Derived Disease Risk Scores as a Preventive, Diagnostic, and Treatment Support Tool

Hu, L.; Bass, M.; Patridge, E.; Molusky, M.; Antoine, G.; Vuyisich, M.; Banavar, G.

2026-06-06 genetic and genomic medicine 10.64898/2026.05.29.26354333 medRxiv
Top 0.7%
1.9%
Show abstract

Background: Chronic diseases and symptom syndromes often develop after prolonged biological changes that may precede formal diagnosis. RNA-based metatranscriptomics captures active microbial and human gene expression and may provide a functional layer for disease risk evaluation. To address this translational gap, we developed and validated a Disease Risk Score (DRS) framework that integrates metatranscriptome-derived pathway activity scores from stool, saliva, and blood samples, and evaluated its potential clinical utility as an adjunct risk-evaluation tool. Methods: DRS uses disease-specific sets of pathway activity scores derived from stool and saliva microbial functions, stool and saliva microbial taxa, and blood human gene expression. For each disease, 'not optimal' pathway scores are aggregated into a normalized cumulative odds ratio, or cOR, using score-level odds ratios, statistical significance, and literature-supported biological relevance derived from a Development Cohort of 22,369 individuals. A cOR [&ge;] 5 is defined as high risk. Performance is evaluated in an independent Validation Cohort of 15,908 individuals using self-reported diseases as the reference. Disease support requires both significant cOR separation between self-reported and not-reported (Cohen's d [&ge;] 0.2) and risk ratio enrichment of self-reported disease among individuals classified as high risk (95% CI of Risk Ratio > 1). Results: Of 20 initially evaluated diseases, 15 meet the prespecified validation criteria on the independent validation cohort: ADHD, anxiety, chronic fatigue syndrome, depression, GERD, hypertension, inflammatory bowel disease, IBS-C, IBS-D, insomnia, MASLD, obesity, obstructive sleep apnea, Sjogren's syndrome, and type 2 diabetes. Five selected clinical scenarios illustrate how DRS can support clinician-mediated decision making, including IBS subtype reclassification, improved diagnostic acceptance in IBS-D, personalized lifestyle counseling in MASLD and early type 2 diabetes, and diagnostic uncertainty in atypical GERD. Conclusions: DRS is a metatranscriptomics-based risk-stratification framework that aggregates active microbial and human pathway signals into interpretable disease-specific risk estimates across a wide range of disease conditions. Validation against self-reported disease labels in an independent cohort shows significant risk enrichment for each of 15 diseases. DRS is intended as an adjunct to clinical evaluation: a decision support tool in situations where routine care encounters uncertainty, delay, or low patient engagement. Future prospective studies using clinically adjudicated endpoints are needed to assess calibration and clinical outcomes.

19
Low-Dose Aspirin Adherence Following Objective cell-free RNA-Based Preeclampsia Risk Testing: A Real-World Survey Study

Moe, A. B.; Haverty, C.; Lee, M.; Hahn, S. E.; McElrath, T. F.; Jain, M.; Rasmussen, M.; Corso, A.; Larson, M. L.; Morrison, H.; Melroy, L. M.; Roofeh, J.; Phelps-Sandall, B.; Kiefer, D.; Biggio, J. R.

2026-06-10 obstetrics and gynecology 10.64898/2026.06.08.26355195 medRxiv
Top 0.8%
1.7%
Show abstract

Introduction: Preeclampsia (PE) is a leading cause of maternal and neonatal morbidity and mortality, and low-dose aspirin (LDA) prophylaxis is the cornerstone of evidence-based prevention. Despite guideline recommendations, LDA adherence remains poor, with 10-25% of moderate-risk patients taking aspirin. Objective personalized risk stratification using biomarkers has been shown to motivate behavior change in other disease contexts. Survey data suggest that patients are more motivated to take aspirin if informed by an objective predictive test. Here, we report real-world LDA adherence among patients who received a high-risk result from a cell-free RNA (cfRNA) PE risk prediction test. Methods: This retrospective, observational survey study included asymptomatic patients of advanced maternal age (AMA; [&ge;] 35 years at delivery) with singleton pregnancies without USPSTF-defined preexisting high-risk conditions for PE who received the cfRNA PE risk prediction test. Patients who opted in to receive text message surveys were asked about LDA use following receipt of test results. High adherence was defined as reporting LDA use on at least 6 of 7 days per week at least 85% of the time surveyed. The primary analysis included patients with a high-risk test result and at least one LDA frequency survey response following receipt of test result. The observed proportion of adherent patients was compared to a baseline estimate of 25% using an exact binomial test. Results: Of 166 patients who received a cfRNA PE risk prediction test result, 48 (28.9%) received a high-risk result. Of these, 29 (60%) opted in and responded to at least one survey, constituting the primary analysis population. Twenty-seven of the 29 (93.1%; 95% CI: 78.0-98.1%) were classified as highly adherent, significantly higher than the 25% baseline adherence estimate for moderate-risk patients (p < 0.0001). Conclusion: Among surveyed patients who received a high-risk cfRNA PE test result, the proportion classified as highly adherent to LDA (93%) substantially exceeded published estimates of adherence in a similar patient population and met the clinically meaningful threshold of [&ge;] 80% associated with reduced risk of preterm preeclampsia. These findings indicate that objective and personalized biomarker risk testing may be a powerful driver of behavior change that current guidelines have failed to produce.

20
The Multimodal Anonymizer: a fully local multi-agent AI system for medical data deidentification

Hirsch, A.; Ten, F. W.; Krueger, K. S.; Geyer, R.; Roeschl, T.; Groeschel, M.; Rostin, P.; Eils, R.; Spott, M.; Prasser, F.; Meyer, A.; Madrid, J.

2026-06-05 health informatics 10.64898/2026.05.28.26353952 medRxiv
Top 0.8%
1.7%
Show abstract

Background: Safe reuse of multimodal hospital data for AI development is limited by the absence of reliable, context-aware deidentification across multimodal data and longitudinal patient data. Existing approaches are largely modality-specific and can indiscriminately remove clinically important information. Methods: We developed the Multimodal Anonymizer, a modular, locally deployable multi-agent framework integrating multimodal large language models, task-specific neural networks and rule-based transformations. We evaluated 16 orchestrator model configurations on a benchmark built from publicly available data and hospital data from our institution. The benchmark dataset included data from different origins: 250 MIMIC-IV patients with synthetically injected personally identifiable information (PII) supplemented with head CT, face images, handwriting, audio, German clinical-text datasets and local data. Primary outcomes were deidentification sensitivity and preservation of clinically important content; secondary analyses examined model characteristics, reproducibility, and performance against leading market and open-source solutions. Results: The best local configuration (the orchestrator being Qwen3-VL-235B-A22B-Thinking) achieved near-complete deidentification across all datasets, with per-patient sensitivity of 98.80% (95%-CI 97.20; 100), and per-PII sensitivity of 99.82% (95%-CI 99.76; 99.88). Critical clinical preservation was 99.60% (95%-CI 98.80; 100) per-patient, and clinical preservation was 99.61% (95%-CI 99.51; 99.71) per-file. All modalities achieved at least 98.30% sensitivity (lower bound 95%-CI). On our local data, the system achieved a deidentification sensitivity of 100% per-patient and per-PII; and a critical clinical preservation of 100% per-patient as well as a clinical preservation of 99.97% (95%-CI 99.91; 100) per-file. When comparing orchestrators, the leading local models were similar to proprietary models (GPT-5.2) in deidentification sensitivity while showing higher deidentification specificity. The Multimodal Anonymizer outperformed previous tools on most modalities. Conclusion: Near-complete, utility-preserving deidentification of multimodal clinical data is achievable with a unified, locally deployable multi-agent system, enabling safer large-scale reuse of hospital data for research and AI development.